-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add more Spark Expressions #1724
base: main
Are you sure you want to change the base?
feat: add more Spark Expressions #1724
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for looking into this! 🙏
Hey @EdAbati , Took an initial swing at implementing the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating! :)
I left another couple of small comments
IMO we can just merge all
, any
and null_count
here and worry about the rest in a follow up
def _all(_input: Column) -> Column: | ||
from pyspark.sql import functions as F # noqa: N812 | ||
|
||
return F.bool_and(_input) | ||
|
||
return self._from_call(_all, "all", returns_scalar=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We simplified a bit the other methods, we can refactor as
def _all(_input: Column) -> Column: | |
from pyspark.sql import functions as F # noqa: N812 | |
return F.bool_and(_input) | |
return self._from_call(_all, "all", returns_scalar=True) | |
from pyspark.sql import functions as F # noqa: N812 | |
return self._from_call(F.bool_and, "all", returns_scalar=True) |
def _any(_input: Column) -> Column: | ||
from pyspark.sql import functions as F # noqa: N812 | ||
|
||
return F.bool_or(_input) | ||
|
||
return self._from_call(_any, "any", returns_scalar=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def _any(_input: Column) -> Column: | |
from pyspark.sql import functions as F # noqa: N812 | |
return F.bool_or(_input) | |
return self._from_call(_any, "any", returns_scalar=True) | |
from pyspark.sql import functions as F # noqa: N812 | |
return self._from_call(F.bool_or, "any", returns_scalar=True) |
same as above
|
||
return self._from_call(_null_count, "null_count", returns_scalar=True) | ||
|
||
def replace_strict( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am tempted to say that this should not be implemented for now and just raise a NotImplementedError
. (as we do in Dask)
We would need to be able to access the dataframe (and collect the results) to get the distinct values of the column.
@FBruzzesi and @MarcoGorelli any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am tempted to say that this should not be implemented for now and just raise a NotImplementedError. (as we do in Dask)
Sure we can evaluate if and how to support replace_strict
later on. Super excited to ship the rest for now 🙌🏼
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree!
What type of PR is this? (check all applicable)
Related issues
Checklist
If you have comments or can explain your changes, please do so below
Added the following methods to
SparkLikeExpr
andSparkLikeNamespace
:any
all
null_count
any_horizontal
Copied respective tests over - couldn't run them without Java on my machine but running them locally on their respective test datasets worked for me.
Let me know if anything needs to be updated!